6 research outputs found
A Call for Standardization and Validation of Text Style Transfer Evaluation
Text Style Transfer (TST) evaluation is, in practice, inconsistent.
Therefore, we conduct a meta-analysis on human and automated TST evaluation and
experimentation that thoroughly examines existing literature in the field. The
meta-analysis reveals a substantial standardization gap in human and automated
evaluation. In addition, we also find a validation gap: only few automated
metrics have been validated using human experiments. To this end, we thoroughly
scrutinize both the standardization and validation gap and reveal the resulting
pitfalls. This work also paves the way to close the standardization and
validation gap in TST evaluation by calling out requirements to be met by
future research.Comment: Accepted to Findings of ACL 202
Text Style Transfer Evaluation Using Large Language Models
Evaluating Text Style Transfer (TST) is a complex task due to its
multifaceted nature. The quality of the generated text is measured based on
challenging factors, such as style transfer accuracy, content preservation, and
overall fluency. While human evaluation is considered to be the gold standard
in TST assessment, it is costly and often hard to reproduce. Therefore,
automated metrics are prevalent in these domains. Nevertheless, it remains
unclear whether these automated metrics correlate with human evaluations.
Recent strides in Large Language Models (LLMs) have showcased their capacity to
match and even exceed average human performance across diverse, unseen tasks.
This suggests that LLMs could be a feasible alternative to human evaluation and
other automated metrics in TST evaluation. We compare the results of different
LLMs in TST using multiple input prompts. Our findings highlight a strong
correlation between (even zero-shot) prompting and human evaluation, showing
that LLMs often outperform traditional automated metrics. Furthermore, we
introduce the concept of prompt ensembling, demonstrating its ability to
enhance the robustness of TST evaluation. This research contributes to the
ongoing evaluation of LLMs in diverse tasks, offering insights into successful
outcomes and areas of limitation
Ordinal Regression for Difficulty Estimation of StepMania Levels
StepMania is a popular open-source clone of a rhythm-based video game. As is
common in popular games, there is a large number of community-designed levels.
It is often difficult for players and level authors to determine the difficulty
level of such community contributions. In this work, we formalize and analyze
the difficulty prediction task on StepMania levels as an ordinal regression
(OR) task. We standardize a more extensive and diverse selection of this data
resulting in five data sets, two of which are extensions of previous work. We
evaluate many competitive OR and non-OR models, demonstrating that neural
network-based models significantly outperform the state of the art and that
StepMania-level data makes for an excellent test bed for deep OR models. We
conclude with a user experiment showing our trained models' superiority over
human labeling
Evaluating Dynamic Topic Models
There is a lack of quantitative measures to evaluate the progression of
topics through time in dynamic topic models (DTMs). Filling this gap, we
propose a novel evaluation measure for DTMs that analyzes the changes in the
quality of each topic over time. Additionally, we propose an extension
combining topic quality with the model's temporal consistency. We demonstrate
the utility of the proposed measure by applying it to synthetic data and data
from existing DTMs. We also conducted a human evaluation, which indicates that
the proposed measure correlates well with human judgment. Our findings may help
in identifying changing topics, evaluating different DTMs, and guiding future
research in this area
Learning to Play Text-based Adventure Games with Maximum Entropy Reinforcement Learning
Text-based games are a popular testbed for language-based reinforcement
learning (RL). In previous work, deep Q-learning is commonly used as the
learning agent. Q-learning algorithms are challenging to apply to complex
real-world domains due to, for example, their instability in training.
Therefore, in this paper, we adapt the soft-actor-critic (SAC) algorithm to the
text-based environment. To deal with sparse extrinsic rewards from the
environment, we combine it with a potential-based reward shaping technique to
provide more informative (dense) reward signals to the RL agent. We apply our
method to play difficult text-based games. The SAC method achieves higher
scores than the Q-learning methods on many games with only half the number of
training steps. This shows that it is well-suited for text-based games.
Moreover, we show that the reward shaping technique helps the agent to learn
the policy faster and achieve higher scores. In particular, we consider a
dynamically learned value function as a potential function for shaping the
learner's original sparse reward signals
Discriminative machine learning for maximal representative subsampling
Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data set until it aligns with the representative one. The second method is a variant of MRS - Soft-MRS - that iteratively adapts sample weights instead of removing samples completely. To assess the effectiveness of our approach, we induced artificial bias in a public census data set and examined the corrected estimates. We compare the performance of our methods against existing techniques, evaluating the ability of sample weights created with Soft-MRS or MRS to minimize differences and improve downstream classification tasks. Lastly, we demonstrate the applicability of the proposed methods in a real-world study of resilience research, exploring the influence of resilience on voting behavior. Through our work, we address the issue of bias in social science, amongst others, and provide a versatile methodology for bias reduction based on machine learning. Based on our experiments, we recommend to use MRS for downstream classification tasks and Soft-MRS for downstream tasks where the relative bias of the dependent variable is relevant